!pr3
Fast 16x16 Multiply & Divide in 65802............John Butterill
                                                Ottowa, Ontario

Recently I needed a 16-bit multiplication subroutine in my 65802-enhanced Apple II.  Naturally, I needed one that was both fast and short.  I referred back to the Jan 86 AAL, which contained several examples for the 65802.  The one named FASTER caught my fancy because it seemed a good compromise between size and speed.  Then I made some changes which I think significantly improve it.

I noted that when you ROR the low half of the product into the multiplier, you get a bit out.  This bit remains in the carry.  If the low-product and the multiplier share the same location, then you can ROL in the low-product bit and ROL out the multi- plier bit at the same time, instead of loading and LSR-ing the multiplier.  By not having to load the multiplier, the Accumu- lator is free to contain the high half of the product without saving and loading it each time around.  The result is rather more compact, fitting into 35 bytes (FASTER took 42 bytes).

It is also faster.  By my calculations, the best and worst cases take 335 and 383 cycles, respectively.  This includes the JSR to call the subroutine and the RTS to get back.

At the expense of two more bytes, I can save nine more cycles:  delete line 1240 and add the following:

       1304    ROR
       1305    ROR A

This avoids the 17th trip through the loop, whose only purpose was to roll-in the final bit of the product.

By the way, some assemblers use the syntax "ROR A" to rotate the contents of the A-register.  The S-C Macro Assembler and some others use the syntax "ROR" with a blank operand field for that mode.  Then "ROR A" means to rotate the contents of the variable named "A", as in my program.  To avoid confusion, you might want to change the variable names, avoiding the name "A".

<<<<listing of multiply subroutine>>>>

A 16-bit by 16-bit division seems inherently messier.  First, the divisor must be shifted left until it is at least greater than half the dividend.  One can do a fast cycle which shifts the divisor all the way to the left, but for every shift left in this loop, the divisor must be shifted right again in the second (subtracting) loop.

In practice, I feel that the values would not be randomly distributed, but would be biased toward smaller values.  I'm more likely to divide by 7 than by 32973, for example.  Therefore it is worthwhile putting in the extra code to shift left only as far as is necessary.  The scaling portion in my subroutine, lines 1240-1300, shift the divisor until either bit 15 = 1 or the divisor equals/exceeds the dividend.

In the second loop, lines 1310-1400, the shifted divisor is repeatedly compared to the dividend.  If it is smaller, it is subtracted and a 1-bit goes into the quotient; otherwise a 0-bit goes in.  The loop stops after it has operated with the divisor shifted back to its original position.  This is ordinary long division, in binary.  The comparison-subtraction is performed from one to 16 times, depending on the values.

As I calculate it, the best case (dividend=divisor) takes 82 cycles.  The worst case, which I think would be $FFFF/1, takes 676 cycles.  The time is a function of the number of significant bits in the answer.

<<<division subroutine>>>

[ John also wrote a nice demonstration driver for his subroutines, allowing you to enter two hexadecimal values and see the result in hexadecimal.  The source code for the demo is included on the monthly/quarterly disk. ]

